Type conversion unit in a multiprocessor system

ABSTRACT

The invention relates to a very large instruction word (VLIW) processor, comprising a plurality of execution units ( 101, 103,105 ), a register file ( 109, 111, 113 ) and a communication network ( 117 ) for coupling the execution units and the register file. In case of an application specific VLIW processor, i.e. a VLIW processor designed for handling a specific range of applications, the communication network of the VLIW processor may not support all types of data conversions. Therefore, it may turn out that a certain data type conversion is not possible for some applications to be run on such a VLIW processor. By incorporation a type conversion unit ( 107 ) in the architecture of the VLIW processor, it can be guaranteed that any desired data type conversion can be performed. In case of a partially connected communication network ( 117 ), a communication device ( 129 ) can be incorporated as well in the architecture, allowing every execution unit to transfer a value to the type conversion unit, and allowing the type conversion unit to transfer a value to any segment of the distributed register file.

The invention relates to a processor comprising a plurality of executionunits, a register file accessible by the execution units and acommunication network for coupling the execution units and the registerfile.

The ongoing demand for an increase in high performance computing has letto the introduction of several solutions in which some form ofconcurrent processing, e.g. parallelism has been introduced into theprocessor architecture. A widely used concept to achieve highperformance is the introduction of instruction level parallelism, inwhich a number of execution units are present in the processorarchitecture for executing a number of instructions more or less at thesame time. Two main concepts have been adopted: the multithreadingconcept, in which several threads of a program are accessible by theexecution units, and the very large instruction word (VLIW) concept, inwhich bundles of instructions corresponding with the functionality ofthe execution units are present in the instruction set.

In case of a Very Large Instruction Word (VLIW) processor, multipleinstructions are packaged into one long instruction, a so-called VLIWinstruction. A VLIW processor uses multiple, independent execution unitsto execute these multiple instructions in parallel. The processor allowsexploiting instruction-level parallelism in programs and thus executingmore than one instruction at a time. In order for a software program torun on a VLIW processor, it must be translated into a set of VLIWinstructions. The compiler attempts to minimize the time needed toexecute the program by optimizing parallelism. The compiler combinesinstructions into a VLIW instruction under the constraint that theinstructions assigned to a single VLIW instruction can be executed inparallel and under data dependency constraints. Encoding of instructionscan be done in two different ways, for a data stationary VLIW processoror for a time stationary VLIW processor, respectively. In case of a datastationary VLIW processor all information related to a given pipeline ofoperations to be performed on a given data item is encoded in a singleVLIW instruction. For time stationary VLIW processors, the informationrelated to a pipeline of operations to be performed on a given data itemis spread over multiple instructions in different VLIW instructions,thereby exposing said pipeline of the processor in the program.

In most high-level programming languages multiple data-types can beused. In programs using C as the programming language, a data type isoften implicitly converted or explicitly casted to another data-type.When executing the program, the actual type conversion may then beperformed in the network of the VLIW processor, or at the output of anexecution unit. In case of an application specific VLIW processor, i.e.a VLIW processor designed for handling a specific range of applications,the network of the VLIW processor or the execution unit may not providethe required type conversion hardware for all data type conversions.Therefore, it may turn out that a certain data type conversion can notbe performed for some applications to be run on such a VLIW processor.

U.S. Pat. No. 6,460,135 describes a microprocessor comprising aninput/output execution unit, a calculation execution unit, a pluralityof data registers, an instruction controller and an interconnectstructure. The instruction controller decodes the instruction word andsends the operation code to the input/output execution unit or thecalculation execution unit. Type information registers are associatedwith the data registers and an information register holds the typeinformation indicating the data type and the effective bit width of thedata stored in the corresponding data register. The instruction worddesignates the type information of the execution result, i.e. the datatype and the effective bit width, independently of the type informationof the data used for the calculation. During execution of an operationrequiring two operands, the calculation execution unit compares the typeinformation of the two operands, and in case a disagreement exists, aninterrupt is generated and subsequently data is converted to the correcttype and this conversion is done in software. In case the input/outputexecution unit has to execute an input/output instruction, it comparesthe type information stored in the type information register with thatof the instruction word. In case of disagreement, an interrupt isgenerated as well and subsequently the data is also converted to thecorrect type and this conversion is done in software.

It is a disadvantage of the prior art processor that an interrupt has tobe generated in order to initiate the type conversion, whichsubsequently has to be performed in software. As a result, the overallperformance of the processor may decrease substantially.

It is an object of the invention to increase the range of data typeconversions that can be performed in an application specificmultiprocessor system, and more in particular in an application specificVLIW processor, increasing the flexibility of those systems.

This object is achieved with a processor of the kind set forth,characterized in that the processor further comprises a conversiondevice for converting the type of data when transferring said databetween an execution unit of the plurality of execution units and theregister file. In case the communication network does not support therequired data type conversion, the type conversion can be performed bythe conversion device. By allowing the conversion device to perform abroad range of type conversions, the flexibility of an applicationspecific multiprocessor system can be increased since differentapplications, i.e. applications outside the original range ofapplications, can be run on the multiprocessor system as well.

An embodiment of the invention is characterized in that the registerfile is a distributed register file, and that the communication networkis a partially connected communication network for coupling theexecution units and selected parts of the distributed register file. Anadvantage of a distributed register file is that it requires less readand write ports per register file segment, resulting in a smallerregister file bandwidth Furthermore, the addressing of a register in adistributed register file requires less bits when compared to a centralregister file. A partially connected communication network is lessexpensive in terms of code size and power consumption, when compared toa fully connected communication network, especially in case of a largenumber of execution units.

An embodiment of the invention is characterized in that the conversiondevice comprises a conversion register file and a conversion unit, theconversion register file being accessible by the conversion unit. Incase the result of an execution unit has to be written to severalregisters of the register file, but with a different data type, the datacan be written to the conversion register file. Subsequently, theconversion unit can read the data from the conversion register file,convert the data into the required type, and write the results to theappropriate register, for each request.

An embodiment of the invention is characterized in that the processorfurther comprises a communication device for coupling the executionunits, the conversion unit, the distributed register file, and theconversion register file. In case of a partially connected communicationnetwork, it can not be guaranteed that there exists a communication pathfrom every execution unit or type conversion unit output to everyexecution unit or type conversion unit input. As a result, an executionunit may not be able to transfer data to the conversion unit Thecommunication device allows transferring data from the execution unitoutput to the conversion unit, and also from the conversion unit to theexecution unit input, in case this is not possible via the communicationnetwork.

An embodiment of the invention is characterized in that thecommunication device supports all data types of a programming language.An advantage of this embodiment is that all data can be transferred tothe conversion device for data type conversion, independent of its typeand without requiring any intermediate conversion by the communicationnetwork or the communication device itself.

An embodiment of the invention is characterized in that thecommunication device couples all execution units, the conversion unit,all parts of the distributed register file, and the conversion registerfile. An advantage of this embodiment is that all execution units cantransfer data to the conversion register file via the communicationdevice, and that the conversion unit can always transfer data to allregister file segments via the communication device.

An embodiment of the invention is characterized in that the conversionunit is part of one of the execution units of the plurality of executionunits. An advantage of this embodiment is that no separate conversionunit is required, saving additional silicon area as well ascommunication connections.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 shows a processor, comprising a plurality of execution units,according to the invention.

Referring to FIG. 1, a schematic block diagram illustrates a VLIWprocessor, comprising a plurality of execution units 101, 103 and 105,and a distributed register file, including the register file segments109, 111, 113. The processor also has a conversion device 135.Conversion device 135 comprises conversion register file 115 and typeconversion unit 107. Register file segment 109 is accessible byexecution units 101 and 103, register file segments 111 and 113 areaccessible by execution unit 105 and conversion register file 115 isaccessible by type conversion unit 107.

The processor also has a partially connected network 117 for couplingthe execution units 101, 103 and 105, and selections of distributedregister file segments 109, 111, 113 and conversion register file 115.The partially connected network 117 also couples conversion device 135with selected distributed register file segments 109, 111 and 113. Thepartially connected network 117 comprises the multiplexers 119, 121,123, 125 and 127. The processor handles a specific range ofapplications, and the partially connected network 117 is designed forthis purpose, i.e. during design of the processor a connection from anexecution unit to a distributed register file segment is made via thepartially connected network, if that execution unit has to write valuesinto that register file segment during execution of an applicationwithin that range. Especially in case of a large number of executionunits, connecting all execution units to all distributed register filesegments via a direct connection will be too expensive in terms ofsilicon area and multiplexing overhead. During design time also theconnections, part of the partially connected network 117, between theexecution units 101, 103 and 105, and the conversion register file 115,as well the connections, being part of the partially connected network117, between the type conversion unit 107 and distributed register filesegments 109, 111 and 113 are fixed. The partially connected network 117also supports a number of data type conversions itself, and which typeconversions are supported is fixed during design of the processor aswell

During execution of an application by the processor, data typeconversions will have to be performed by the processor. For example,execution unit 101 produces an output in the form of an unsigned fixedpoint number, comprising 16 bits from which 15 bits are positionedbehind the decimal point, that has to be written to register filesegment 111, via the partially connected network 117. Execution unit 105will use this data output as input for an operation, but this input isrequired to be an unsigned fixed number, comprising 32 bits from which31 bits are positioned behind the decimal point. Therefore, the type ofthe data will have to be converted. In this case, the partiallyconnected network supports this data type conversion, and the unsignedfixed point number comprising 16 bits is implicitly converted by themultiplexer 123 to an unsigned fixed point number comprising 32 bits.

When executing an application that is outside the range for which theprocessor is originally designed, it may turn out that a required datatype conversion can not be performed implicitly by the processor. Forexample, execution unit 103 produces a data output in the form of anunsigned fixed point number, comprising 16 bits, that should be writtento register file segment 113, via the partially connected network 117.Execution unit 105 requires these data as input data for an operation,as floating point number comprising 32 bits. However, multiplexer 125 isnot capable of converting the type of the data from an unsigned fixedpoint number to a floating point number. In this case, execution unit103 writes the data to conversion register file 115, via the partiallyconnected network 117. Type conversion unit 107 reads the data fromregister file segment 115, and this unit converts the type of the datafrom unsigned fixed point number to floating point number, by executinga dedicated instruction. Subsequently, type conversion unit 107 writesthe data in the form of a floating point number to register file segment113, via the partially connected network 117. Now the data are availablein the correct data type for execution unit 105.

Another possibility is that during execution of an application, outsidethe range for which the processor is originally designed, an executionresult is used as input data by more than one execution unit, but theseexecution units require a different data type. Performing the sameoperating twice, and producing an output result with a different type isnot possible if the execution unit comprises an internal state. Forexample, in case of a Multiply Accumulation Unit having an internalaccumulation register, performing two subsequent identical operationswith the same input data will result in a different output result. Forexample, execution unit 105 produces output data in the form of anunsigned fixed point number comprising 32 bits, and these data have tobe written twice to register file segment 109, via the partiallyconnected network 117, once as an unsigned fixed point number comprising16 bits and once as a floating point number comprising 32 bits.Subsequently, these data are required as input data for execution units101 and 103, respectively. However, the partially connected network 117can not perform both of the required data type conversions. Executionunit 105 writes its output data to conversion register file 115, via thepartially connected network 117. Execution unit 107 reads the data fromconversion register file 115, converts the data from an unsigned fixedpoint number comprising 32 bits to an unsigned fixed point numbercomprising 16 bits, and writes the converted data to register filesegment 109, via the partially connected network 117. Next, executionunit 115 reads the data again from conversion register file 115,converts the data from an unsigned fixed point number comprising 32 bitsto a floating point number comprising 32 bits, and writes the converteddata to register file segment 109, via the partially connected network117. Subsequently, these data can be read by execution units 101 and 103from register file segment 109, and used for further processing.

For some applications executing on the processor, writing data from theexecution units 101, 103 and 105 to the conversion register file 115 orwriting data from the type conversion unit 107 to register file segments109, 111 and 113 may require more than one step. For example, executionunit 101 produces output data of type floating point number, and thisdata has to written to register file segment 111 as an unsigned fixedpoint number, to be used as input data for an operation to be performedby execution unit 105. However, the partially connected network does notsupport this type conversion. The type conversion can be performed bytype conversion unit 107, but execution unit 101 can not write directlyits output data to register file segment 115, via the partiallyconnected network 117, but only via an alternative route. A possiblealternative route is that execution unit 101 writes its output data toregister file segment 111, via the partially connected network 117,without implicit data type conversion. Execution unit 105 reads theoutput data from register file segment 111, and write these output datato register file segment 115, via the partially connected network 117.Subsequently, type conversion unit 107 reads the output data fromregister file segment 115 and performs the required data typeconversion. Type conversion unit 107 is not capable of writing the datadirectly to register file segment 111, via the partially connectednetwork 117, but only via an alternative route. A possibility is thattype conversion unit 107 writes the data to register file segment 109,via the partially connected network 117. Subsequently, the data are readfrom the register file segment 109 by execution unit 101, who writes thedata to register file segment 111, via the partially connected network117. In case during compilation of a program the compiler detects thatdata cannot be written directly by an execution unit to the conversionregister file, or by the type conversion unit directly to a registerfile segment, it will determine an alternative route and inserts therequired additional instructions in the program.

In case the partially connected network 117 is not capable of performingthe desired type conversion, the type conversion unit 107 can performthis type conversion and write the converted data to the proper registerfile segment via the partially connected network. As a result, theprocessor can still efficiently execute applications outside the rangefor which the processor was originally designed, increasing theflexibility of the processor. During compilation of such application,the compiler will detect that a required data type conversion can not beperformed implicitly by the network, and introduces additionalinstructions in the program for sending the data to the type conversionunit 107, via the partially connected network 117, converting the datato the required data type by the type conversion unit 107, and sendingthe converted data to the required register file segment, via thepartially connected network 117. The explicit type conversion performedby the type conversion unit 107 can be implemented by means of one ormore operations, as known by the person skilled in the art. For example,when only using unsigned fixed point types, a shift left operation, ashift right operation and an AND operation will suffice. In case ofsigned fixed point types it should be possible to add bits as mostsignificant bits in case of a shift right operation, in order to preventa change of the sign bit.

In another embodiment the communication network 117 may be a fullyconnected communication network, i.e. all execution units 101, 103 an105, and type conversion unit 107 are coupled to all distributedregister file segments 109, 111 and 113, and the conversion registerfile 115. In case of a relatively small number of execution units, theoverhead of a fully connected communication network will be relativelysmall.

In alternative embodiments, the processor also comprises a communicationdevice 129 for coupling the functional units 101, 103 and 105, typeconversion unit 107, and all distributed register file segments 109, 111and 113, and conversion register file 115. The communication device 129shares multiplexers 119, 121, 123, 125 and 127 with the partiallyconnected network 117. The communication device support all data typesfor the programming language in which the application to be executed iswritten.

In some situations, it may turn out that the partially connected network117 can not implicitly perform a required type conversion. On top ofthat, an alternative route for writing the data to conversion registerfile 115 of type conversion unit 107, or writing the data from typeconversion unit 107 to register file segments 109, 111 and 113 mayrequire many steps or even does not exist. In these cases, thecommunication device 129 allows transferring values between theexecution units 101, 103 and 105, the type conversion unit 107, thedistributed register file segments 109, 111 and 113, and the conversionregister file 115, in case this is not possible via the partiallyconnected network 117. In this way a communication path between eachoutput of the execution units 101, 103, 105, and type conversion unit107, and each input of the execution units 101, 103 and 105, and typeconversion unit 107 is guaranteed to exist. For instance, execution unit101 is not directly coupled to register file segment 115 via thepartially connected network 117, but a direct coupling only exists viacommunication device 129. If possible, however, direct communicationbetween the execution units, type conversion unit and register files viathe partially connected network 117 is preferred.

For example, execution unit 101 produces result data as an unsignedfixed point number comprising 32 bits, and these data have to be writtento register file segment 111, for subsequent use by execution unit 105,which requires data as floating point number as input data. Executionunit 101 can not write the data directly to register file segment 111via the partially connected network 117 since it does not support thistype of data conversion. Execution unit 101 can also not write theoutput data directly to register file segment 115 via the partiallyconnected network 117, as this connection does not exist. On top ofthat, type conversion unit 107 can also not write data directly toregister file segment 111 via the partially connected network 117, sincethis connection also does not exist. The compiler detects these problemsduring program compilation, decides to transfer data via thecommunication device 129, and inserts the appropriate instructions forperforming these data transfers in the program. The execution unit 101writes the output data to register file segment 115, via communicationdevice 129. Subsequently, the type conversion unit 107 reads the datafrom conversion register file 115 and converts the type of the data to afloating point number. Subsequently, type conversion unit 107 writes thedata to register file segment 111, via communication device 129. Inalternative embodiments, data may be written from execution units 101,103 and 105 to conversion register file 115 via the partially connectednetwork 117, and subsequently from type conversion unit 107 to registerfile segments 109, 111 and 113 via communication device 129. In anotherembodiment data may be written from execution units 101, 103 and 105 toconversion register file 115 via communication device 129, andsubsequently from type conversion unit 107 to register file segments109, 111 and 113 via the partially connected network 117.

Preferably, the communication device 129 is arranged for communicationwith a first latency, the partially connected communication network 117is arranged for communication with a second latency, the first latencyexceeding the second latency. An advantage of this embodiment is that itprevents the communication via the communication device 129 from beingthe rate-limiting step, so that it allows the processor to run atmaximal clock frequency. Furthermore a high throughput is realized.Usually, the communication device 129 comprises a form of sharedcommunication mechanism. Therefore, the communication via thecommunication device 129 may be slow down by its control logic,especially in case of a large number of execution units. Dividing thecommunication via the communication device into several sequentialsteps, each of which takes place in one clock cycle, keeps the latencyof one communication step low. This prevents the communication via thecommunication device to limit the clock frequency of the processor. Thetotal latency of the communication via the communication device, beingthe sum of the latencies of all separate steps, will be higher than thelatency of the communication via the partially connected communicationnetwork. However, the higher latency of the communication via thecommunication device 129 will hardly affect the overall performance ofthe processor, since the majority of the communication will take placevia the partially connected communication network 117.

In an advantageous embodiment, the communication device 129 comprises amultiplexer 131 and a global bus 133, the multiplexer being arranged forcoupling the functional units 101, 103 and 105, type conversion unit107, and the global bus 133, the global bus 133 being arranged forcoupling the multiplexer 131 and all distributed register file segments109, 111 and 113, and conversion register file 115. The global bus 133differs from the partially connected communication network 117 in thatmultiple functional units 101, 103 and 105, and type conversion unit 107are coupled to the global bus 133 and these functional units and typeconversion unit time-multiplex the global bus, whereas the partiallyconnected communication network 117 couples one execution unit or theconversion unit to a register file segment or the conversion registerfile. An advantage of a global bus is that the overhead in terms ofsilicon area is relatively low when compared to a fully connectedcommunication network.

The execution units or type conversion unit can be coupled to oneregister file segment, as in case of type conversion unit 107, or tomultiple register file segments, as in case of execution unit 105, ormultiple functional units may be coupled to one register file segment,as in case of the functional units 101 and 103. The register filesegments can be coupled to one execution unit, as in case of conversionregister file 115, or to multiple execution units, as in case ofregister file segment 109. The degree of coupling between the registerfile segments and the execution units can depend on the type ofoperations that the execution unit has to perform.

In the embodiment shown in FIG. 1, the partially connected network 117and the communication device 129 share some resources, such as themultiplexers 119, 121, 123, 125 and 127. In other embodiments even moreresources may be shared, or no resources are shared.

In other embodiments, the type conversion unit 107 may be part of one ofthe execution units 101, 103 and 105, and the register file segment 115being part of the corresponding register file segment of that executionunit.

A superscalar processor also comprises multiple issue slots that canperform multiple operations in parallel, as in case of a VLIW processor.However, the processor hardware itself determines at runtime whichoperation dependencies exist and decides which operations to execute inparallel based on these dependencies, while ensuring that no resourceconflicts will occur. The principles of the embodiments for a VLIWprocessor, described in this section, also apply for a superscalarprocessor. In general, a VLIW processor may have more execution units incomparison to a superscalar processor. The hardware of a VLIW processoris less complicated in comparison to a superscalar processor, whichresults in a better scalable architecture. The number of execution unitsand the complexity of each execution unit, among other things, willdetermine the amount of benefit that can be reached using the presentinvention.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention, and that those skilled in the art willbe able to design many alternative embodiments without departing fromthe scope of the appended claims. In the claims, any reference signsplaced between parentheses shall not be construed as limiting the claim.The word “comprising” does not exclude the presence of elements or stepsother than those listed in a claim. The word “a” or “an” preceding anelement does not exclude the presence of a plurality of such elements.In the device claim enumerating several means, several of these meanscan be embodied by one and the same item of hardware. The mere fact thatcertain measures are recited in mutually different dependent claims doesnot indicate that a combination of these measures cannot be used toadvantage.

1. A processor comprising: a plurality of execution units (101, 103,105); a register file (109, 111, 113) accessible by the execution units;a communication network (117) for coupling the execution units and theregister file, characterized in that the processor further comprises aconversion device (135) for converting the type of data whentransferring said data between an execution unit of the plurality ofexecution units and the register file.
 2. A processor according to claim1, wherein: the register file (109, 111, 113) is a distributed registerfile; the communication network (117) is a partially connectedcommunication network for coupling the execution units and selectedparts of the distributed register file.
 3. A processor according toclaim 2, wherein: the conversion device (135) comprise a conversionregister file (115) and a conversion unit (107), the conversion registerfile being accessible by the conversion unit.
 4. A processor accordingto claim 3, characterized in that the processor further comprises acommunication device (129) for coupling the execution units (101, 103,105), the conversion unit (107), the distributed register file (109,111, 113), and the conversion register file (115).
 5. A processoraccording to claim 4, characterized in that the communication device(129) supports all data types of a programming language.
 6. A processoraccording to claim 4, characterized in that the communication device(129) couples all execution units (101, 103, 105), the conversion unit(107), all parts of the distributed register file (109, 11, 113), andthe conversion register file (115).
 7. A processor according to claim 6,characterized in that the conversion unit (107) is part of one of theexecution units of the plurality of execution units (101, 103, 105).